AITopics | peter richt arik

Collaborating Authors

peter richt arik

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Randomized Subspace Nesterov Accelerated Gradient

Omiya, Gaku, Poirion, Pierre-Louis, Takeda, Akiko

arXiv.org Machine LearningMay-4-2026

Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov acceleration in oracle complexity.

artificial intelligence, machine learning, sketch, (17 more...)

arXiv.org Machine Learning

2605.0074

Country:

Asia > Japan (0.28)
North America > United States (0.28)

Genre: Research Report (0.83)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

SEGA: Variance Reduction via Gradient Sketching

Filip Hanzely, Konstantin Mishchenko, Peter Richtarik

Neural Information Processing SystemsFeb-15-2026, 08:55:22 GMT

Neural Information Processing Systems http://nips.cc/

gradient, peter richt arik, sega, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Virginia (0.04)
North America > Canada (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

Redie, Dawit Kiros, Arablouei, Reza, Werner, Stefan

arXiv.org Machine LearningJan-29-2026

Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under non-IID data, the residual error can decay slowly, causing gradient mismatch and stalled progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which integrates step-ahead (SA) correction with partial error feedback (PEF). SA-PEF recovers EF when the step-ahead coefficient α = 0 and step-ahead EF (SAEF) when α = 1. For non-convex objectives and δ-contractive compressors, we establish a second-moment bound and a residual recursion that guarantee convergence to stationar-ity under heterogeneous data and partial client participation. To balance SAEF's rapid warm-up with EF's long-term stability, we select α near its theory-predicted optimum. Experiments across diverse architectures and datasets show that SA-PEF consistently reaches target accuracy faster than EF. Modern large-scale machine learning increasingly relies on distributed computation, where both data and compute are spread across many devices. Federated learning (FL) enables model training in this setting without centralizing raw data, enhancing privacy and scalability under heterogeneous client distributions (McMahan et al., 2017; Kairouz et al., 2021). In each synchronous FL round, the server broadcasts the current global model to a subset of clients. These clients perform several steps of stochastic gradient descent (SGD) on their local data and return updates to the server, which aggregates them to form the next global iterate (Huang et al., 2022; Wang & Ji, 2022; Li et al., 2024). Although FL leverages rich distributed data, it faces two key challenges.

artificial intelligence, compression, machine learning, (17 more...)

arXiv.org Machine Learning

2601.20738

Country:

Europe > Norway (0.14)
Oceania > Australia (0.04)
North America > United States > Virginia (0.04)
Europe > Finland (0.04)

Genre: Research Report > New Finding (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

SEGA: Variance Reduction via Gradient Sketching

Filip Hanzely, Konstantin Mishchenko, Peter Richtarik

Neural Information Processing SystemsNov-20-2025, 21:23:59 GMT

We propose a randomized first order optimization method-- SEGA (SkEtched GrAdient)--which progressively throughout its iterations builds a variance-reduced estimate of the gradient from random linear measurements (sketches) of the gradient. In each iteration, SEGA updates the current estimate of the gradient through a sketch-and-project operation using the information provided by the latest sketch, and this is subsequently used to compute an unbiased estimate of the true gradient through a random relaxation procedure. This unbiased estimate is then used to perform a gradient step. Unlike standard subspace descent methods, such as coordinate descent, SEGA can be used for optimization problems with a non-separable proximal term. We provide a general convergence analysis and prove linear convergence for strongly convex objectives. In the special case of coordinate sketches, SEGA can be enhanced with various techniques such as importance sampling, minibatching and acceleration, and its rate is up to a small constant factor identical to the best-known rate of coordinate descent.

artificial intelligence, machine learning, optimization problem, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Virginia (0.04)
North America > Canada (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction

Tyurin, Alexander

arXiv.org Artificial IntelligenceJul-1-2025

We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $σ^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $τ_{s}$ and $τ_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $\frac{h σ^2 L Δ}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $Δ= f(x^0) - f^*,$ and $x^0 \in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $τ_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $τ_{s} d \frac{L Δ}{\varepsilon}$ and the variance-dependent runtime term $\frac{h σ^2 L Δ}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new "worst-case" function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.

artificial intelligence, machine learning, server, (15 more...)

arXiv.org Artificial Intelligence

2506.23836

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.55)

Add feedback

Provable Reduction in Communication Rounds for Non-Smooth Convex Federated Learning

Palenzuela, Karlo, Dadras, Ali, Yurtsever, Alp, Löfstedt, Tommy

arXiv.org Artificial IntelligenceMar-27-2025

Multiple local steps are key to communication-efficient federated learning. However, theoretical guarantees for such algorithms, without data heterogeneity-bounding assumptions, have been lacking in general non-smooth convex problems. A typical FL algorithm consists of two main phases: local training and aggregation. Scaffold (Karimireddy et al., 2020) and Scaffnew (Mishchenko et al., 2022) stand out as notable We explore the following natural question in this work: Can multiple local steps provably reduce communication rounds in the non-smooth convex setting? Authors made an equal contribution to this work.

artificial intelligence, machine learning, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2503.21627

Country:

Europe > Sweden (0.05)
North America > United States > Wisconsin (0.04)
North America > United States > Virginia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates

Kovalev, Dmitry, Mishchenko, Konstantin, Richtárik, Peter

arXiv.org Machine LearningDec-3-2019

We present two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions. The first is a stochastic variant of Newton's method (SN), and the second is a stochastic variant of cubically regularized Newton's method (SCN). We establish local linear-quadratic convergence results. Unlike existing stochastic variants of second order methods, which require the evaluation of a large number of gradients and/or Hessians in each iteration to guarantee convergence, our methods do not have this shortcoming. For instance, the simplest variants of our methods in each iteration need to compute the gradient and Hessian of a {\em single} randomly selected function only. In contrast to most existing stochastic Newton and quasi-Newton methods, our approach guarantees local convergence faster than with first-order oracle and adapts to the problem's curvature. Interestingly, our method is not unbiased, so our theory provides new intuition for designing new stochastic methods.

arxiv preprint arxiv, hessian, international conference, (12 more...)

arXiv.org Machine Learning

1912.01597

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Gradient Descent with Compressed Iterates

Khaled, Ahmed, Richtárik, Peter

arXiv.org Machine LearningSep-10-2019

We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed by a mobile device before it is sent back to a server for aggregation. Our analysis provides a step towards closing the gap between the theory and practice of federated learning, and opens the possibility for many extensions.

artificial intelligence, arxiv, machine learning, (14 more...)

arXiv.org Machine Learning

1909.04716

Country: Asia > Middle East > Saudi Arabia (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

Primal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses

Csiba, Dominik, Richtárik, Peter

arXiv.org Machine LearningJun-7-2015

In this work we develop a new algorithm for regularized empirical risk minimization. Our method extends recent techniques of Shalev-Shwartz [02/2015], which enable a dual-free analysis of SDCA, to arbitrary mini-batching schemes. Moreover, our method is able to better utilize the information in the data defining the ERM problem. For convex loss functions, our complexity results match those of QUARTZ, which is a primal-dual method also allowing for arbitrary mini-batching schemes. The advantage of a dual-free analysis comes from the fact that it guarantees convergence even for non-convex loss functions, as long as the average loss is convex. We illustrate through experiments the utility of being able to design arbitrary mini-batching schemes.

artificial intelligence, machine learning, mini-batching scheme, (14 more...)

arXiv.org Machine Learning

1506.02227

Country: Europe > United Kingdom (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback